AD-A276  801 


\TION  PAGE 


Fom  Approved 
OBMNo.  OTOMm 


p*rr«600nM.  ,  . 

m  . ^ ..*««MN«uiu(iaiiniormaiion.  OHiTmnu itgifoiio tfiii 

xj>  imjuang  tn*  dufM.  to  Waftdmgion  HotdquvM  Sotvbm.  Ovodorao  tor  MomMUon  OpMi 
VA  222102-4X2.  and  to  the  Othce  at  Maraoemortt  and  Budoet,  Paperwork  Reduction  Proieci  (070i^l8A.  WaehnMon.  I 


1  AGENCY  USE  ONLY  (Leave  Blaik) 


4  TITLE  AND  SUBTITLE 


2.  REPORT  DATE 

November  1993 


REPORT  TYPE  AND  DATES  COVERED 
meiiK»andum 


Convergence  Results  for  the  EM  Approach  to  Mixtures  of  Experts 
Architectures 


6  AUTHOR(S) 

Michael  I.  Jordan  and  Lei  Xu 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  AOORESS(ES) 

Massachusetts  Institute  of  Technology 
Artificial  Intelligence  Laboratory 
545  Technology  Square 
Cambridge,  Massachusetts  02139 


9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRE 


5.  FUNDING  NUMBERS 

NSF  9217041-ASC 
IRI-9013991 
N00014-90-J-1942 
ECS-92 16531 


8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


Office  of  Naval  Research 
Information  Systems 
Arlington,  Virginia  22217 


IV 


AIM  1458 
CBCL  87 


10.  SPONSORING^NITORING 
AGENCY  REPORT  NUMBER 


12a  DISTRIBUTION/AVAILABIUTY  STATEMENT  12b.  DISTRIBUTION  CODE 

DISTRIBUTION  UNLIMITED 


13.  ABSTRACT  (Max/rnum  200  words) 

The  Expectation-Maximization  (EM)  algorithm  is  an  iterative  approach  to  maximum  likelihood  parameter 
estimation.  Jordan  and  Jacobs  (1993)  recently  proposed  an  EM  algorithm  for  the  mixture  of  experts 
architecture  of  Jacobs,  Jordan,  Nowlan  and  Hinton  (1991)  and  the  hierarchical  mixture  of  experts 
architecture  of  Jordan  and  Jacobs  (1992).  They  showed  empirically  that  the  EM  algorithm  for  these 
architectures  yields  significantly  faster  convergence  than  gradient  ascent.  In  the  current  paper  we  provide 
a  theoretical  analysis  of  this  algorithm.  We  show  that  the  algorithm  can  be  regarded  as  a  variable  metric 
algorithm  with  its  searching  direction  having  a  positive  projection  on  the  gradient  of  the  log  likelihood. 
We  also  analyze  the  convergence  of  the  algorithm  and  provide  an  explicit  expression  for  the  convergence 
rate.  In  addition,  we  descril^  an  acceleration  technique  that  yields  a  significant  speedup  in  simulation 
experiments. 


14.  SUBJECT  TERMS  15.  NUMBER  OF  PAGES 

33 

16.  PRICE  CODE 


1 7.  SECURITY  CUSSIFICATION  1 8.  SECURITY  CLASSIFICATION  19.  SECURITY  CLASSIFICATION  20.  UMITATION  OF 

OF  REPORT  OF  THIS  PAGE  OF  ABSTRACT  ABSTRACT 


UNCLASSIRED 


Ik’  w  k’ 


UNCLASSIFIED 


UNCLASSIFIED 


DTIC  Qu/,. 


UNCLASSIFIED 


onn 

PiMOlwd  ty  ANSI  SU.  23B-1S 


:::  3 


MASSACHrSKTTS  iXS'l’m  TK  OF  TKCIIXOIXKO 
AHTIFK’IAL  FXTKLFICFXCF  LABOR ATOHV 


and 


(’FXTFH  FOR  BIOLO(;i('AL  AXl)  OOMRFTATIOXAL  LFARXIXC 
DKRARTMKXT  OF  BRAIX  AXl)  COOXFnX  F  SCIFXCFS 


A.l.  Memo  Xo.  UoS  Xcrn'iiiLer  18.  l‘F)d 

C.B.C'.L.  Memo  .Xo.  87 


Convergence  Results  for  the  EM  Approach  to 
Mixtures  of  Experts  Architectures 


Michael  I.  Jordan  and  Lei  Xn 


Abstract 


The  E.vpectatioii-Maximizatioii  (EM)  algorithm  is  an  iterative  approach  to  maximimi  likelihood  pa¬ 
rameter  estimation.  .Jordan  and  -Jacobs  (iyy:J)  recently  projtosed  an  EM  algorithm  for  the  mixture  of 
experts  architecture  of -Jacobs,  Jordan,  ,Novvlan  and  Hinton  (1091)  and  the  hierarchical  mixture  of  ex¬ 
perts  architecture  of -Jordan  and  Jacobs  (1092),  They  showed  empirically  that  the  EM  algorithm  for 
these  architectures  yields  sigtiificantly  faster  convergeitce  than  gradient  ascent.  In  the  current  pajier 
we  provide  a  theoretical  atialysis  of  this  algorithm.  We  show  that  the  algorithm  can  be  regarded  as  a 
variable  tiietric  algorithm  with  its  searching  direction  having  a  positive  projection  on  the  gradient  of  the 
log  likelihood.  We  also  analyze  the  convergence  of  the  algorithm  .nd  provide  an  explicit  expression  for 
the  cottvergence  rate.  In  addition,  we  describe  an  acceleration  technique  that  yields  a  significant  sj^eedup  . 
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Introduction 


Althouj2;li  lU'Uial  networks  are  <ai)al)le  in  jirinripie  of  represent inp,  eonipk'X  innilinear  functions, 
tlie  time  rerpiired  to  train  a  com|)lex  network  <lctes  not  always  scak'  well  witli  prokknii  si/e 
and  the  solution  obtained  does  not  always  reveal  tin*  struct  me  in  tlie  problem.  .\loreo\’ei'.  it  is 
often  difficnlt  to  e.xpress  prior  knowk><lj2;e  in  the  lanp,nap;e  of  fnlly-connected  neural  networks. 
.-\chieving  better  scaling  behavior,  better  interpretalrility  of  solutions  and  better  ways  ol  incor¬ 
porating  prior  knowledge  may  reipiire  a  more  modidar  airproach  in  which  the  learning  |)rol)lem 
is  decomposed  into  snb-i)rol)lems.  Such  an  approach  has  been  used  with  success  in  the  statistics 
literature  and  the  machine  learning  literature.  wher<’  <lecision-tree  algorithms  such  as  C.AR'l 
and  lUd  and  multivariate  sirline  algorithms  such  as  .\1.\RS  have  running  times  that  can  be 
orders  of  magnitude  faster  than  neural  network  algorithms  and  often  yield  sim|)le.  interpretable 
solutions  (Breiman,  Friedman.  Olshen  tV  Stone.  19S  |;  Frietlman.  1991:  Quinlan.  19S(j). 

general  strategy  for  designing  modular  learning  systems  is  to  treat  the  problem  as  one 
of  combining  multiple  models,  each  of  which  is  defined  over  a  local  region  of  the  input  .space, 
.lacobs.  .Iordan.  Nowlan  and  Hinton  (  1991)  introduced  such  a  strategy  with  their  “mixture  of 
experts"  (ME)  architecture  for  su])ervised  learning.  The  architecture  involves  a  set  of  functirrii 
approximators  (  "expert  networks”)  that  are  combined  by  a  classifiei'  (  "gating  network"  ).  These 
networks  are  trained  simultaneously  so  as  to  split  the  input  space  into  regions  where  particular 
experts  can  spccialue.  .Jordan  and  .Jacobs  ( 1992) extended  this  approach  to  a  recursively-defined 
architecture  in  which  a  tree  of  gating  networks  combine  the  expert  networks  into  successively 
larger  groupings  that  are  defined  over  nested  regions  of  the  input  space.  This  "hierarchical 
mixture  of  experts"  (HME)  architecture  is  closely  related  to  the  decision  tree  and  multivariate 
spline  algorithms. 

The  problem  of  training  a  mixture  of  experts  architecture  can  be  treated  as  a  maximum 
likelihood  estimation  problem.  Both  .Jacobs  et  al.  ( 1991 )  and  .Jordan  and  .Jacobs  ( 1992)  derived 
learning  algorithms  by  computing  the  gradient  of  the  log  likelihood  for  their  respective  archi¬ 
tectures.  Empirical  tests  revealed  that  although  the  gradient  approach  succeeded  in  finding 
reasonable  parameter  values  in  particular  problems,  the  convergence  rate  was  not  significantly 
better  than  that  obtained  by  using  gradient  methods  in  multi-layered  neural  network  archi¬ 
tectures.  The  gradient  approach  did  not  appear  to  take  advantage  of  the  modularity  of  the 
architecture.  .\n  alternative  to  the  gradient  approach  was  proposed  by  .Jordan  and  .Jacobs  (in 
press),  who  introduced  an  Expectation-Maximization  (EM)  algorithm  for  mixture  of  experts 
architectures.  EM  is  a  general  technique  for  maximum  likelihood  estimation  that  can  often 
yield  simple  and  elegant  algorithms  (Baum.  Petrie.  Soules  ik'  Weiss.  1970:  Dempster.  Laird  <k' 
RuJnn.  1977).  For  mixture  of  experts  architectures,  the  EM  algorithm  decouples  the  estimation 
process  in  a  manner  that  fits  well  with  the  modular  structure  of  the  architecture.  .Moreover. 
.Jordan  and  .Jacobs  (in  press)  observed  a  significant  speedup  over  gradient  techniques. 

In  this  paper,  we  provide  further  insight  into  the  EM  approach  to  mixtures  of  experts 
architectures  via  a  set  of  convergence  theorems.  We  stiidy  a  particular  variant  of  the  EM 
algorithm  proposed  by  .Jordan  and  .Jacobs  (in  press)  and  demonstrate  a  relationship  between 
this  algorithm  and  gradient  ascent.  We  also  provide  theorems  on  the  convergence  rate  of  the 
algorithm  and  provide  explicit  formulas  for  the  constants. 

The  remainder  of  the  paper  is  organized  as  follows.  Section  2  introduces  the  ME  model. 
The  EM  algorithm  for  this  architecture  is  derived  and  two  convergence  theorems  are  presented. 
Section  3  presents  an  analogous  derivation  and  a  set  of  convergence  results  for  the  HME  model. 
Section  -1  introduces  two  acceleration  techniques  for  improving  convergence  and  presents  the 
results  of  numerical  experiments.  Section  -o  presents  our  conclusions. 
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Figure  1;  The  mixture  of  experts  architecture.  The  total  output  fi  is  the  weighted  sum  of  the 
expert  network  outputs:  /i  =  +  c/2/i2.  where  the  weights  are  the  gating  network  outputs 

(ji  and  (j2. 

Theoretical  analysis  of  an  EM  algorithm  for  the  mixture  of 
experts  architecture 

Network  learning  based  on  maximum  likelihood  estimation 

We  begin  by  studying  the  non-hierarchical  case.  As  shown  in  Figure  1.  the  mixture  of  experts 
(ME)  architecture  is  comprised  of  1\  expert  networks,  each  of  which  solves  a  function  approx¬ 
imation  problem  over  a  local  region  of  the  input  space.  To  each  exi)ert  network  we  associate 
a  probabilistic  model  that  relates  input  vectors  x  €  R"  to  output  vectors  y  G  R"'  ■  We  denote 
these  probabilistic  models  as  follows 

P(y\x.ej).j  = 

where  the  0,  are  parameter  vectors.  Each  of  these  probability  densities  is  assumed  to  belong 
to  the  exponential  family  of  densities  (.Jordan  A’  .Jacobs,  in  press).  The  j"'  exj)ert  network 
produces  as  output  a  parameter  vector  /li, 

Hj  =  ij(x.0j).  j  ^  1.2.  -./\. 

which  is  the  location  parameter  for  the  j*’’  probability  density.  In  the  current  paper,  as  in 
-Jordan  and  .Jacobs  (in  press),  we  treat  the  ca.se  in  which  the  functioiis  fj  are  linear  in  the 
parameters.  We  extend  our  results  to  the  case  of  experts  that  are  nonlinear  in  the  parameters 
in  the  Appendi.x. 

We  also  assume,  for  .simplicity,  that  the  probability  densities  F(y\x.6i)  are  gaussian.  im¬ 
plying  that  the  location  parameter  is  simply  the  mean.  We  as.sociate  a  covariance  matrix 
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with  t'acli  oxpf'rt  network,  yieldhif^  the  I’ollowiiifj;  |)rol>ahilistic  model  for  ex|)ert  j 


P(y\x.e,)  =  (2- (let  L'j)  '  ox|>{-:^[y  -  f,(x.0/)]^  '[y  -  f,(x.0,)]} 


(1) 


riie  .\IK  architect  lire  also  utilizes  a  auxiliary  network  known  as  a  lutwork.  whose  job 

it  is  to  partition  the  input  sirace  into  regions  corresponding  to  the  various  expert  mU works.  I  his 
is  done  by  assigiung  a  irrobability  vector  [</i - t/i- ■  ■  • -//a to  each  point  in  the  input  space.  In 
particular,  the  gating  net  implements  a  parameterized  function  s  :  R'  —  R^'^  and  a  normalizing 
function  g  :  —  R^'"  such  that 


fJj  = 


(■‘‘j 


I.  -  -.  A'. 


(2) 


which  satisfies 


A 

^(y,(x.0,,)  =  I.  for  any  x.6,^. 

./=! 


(d) 


In  the  current  paper  we  focus  on  the  case  in  which  tin*  function  s  is  linear  (cf.  .Iordan  A’  .lacobs. 
in  press).  In  this  case  the  boundaries  </,  =  ry,/  are  planar  and  the  function  g  can  be  viewed  as  a 
smoothed  piecewise-planar  iiartitioning  of  t he  input  space. 

Training  data  are  assumed  to  be  generated  according  to  the  following  probability  model. 
We  assume  that  for  a  given  x.  a  label  j  is  selected  with  probability  P{j\x)  =  (jiix.d,/).  .\n 
output  y  is  then  choseti  with  probability  P(y\x.6j).  Thus  the  total  probability  of  observing  y 
from  X  is  given  by  the  following  finite  tuixtiire  density 


l< 


P{y\x)  =  Y,  P{j\y.)P(y\x.0j)  =  J^i/,(x.ay)P(y|x.0,; 


,/=i 


./  =  ! 


(d) 


.\  trainitig  set  T  =  {(x***.y*'*).  /  =  I.  •  •  • .  .V}  is  assutiied  to  be  generated  as  an  independent  set 
of  draws  from  this  mixture  density.  Thus  the  total  probability  of  the  training  set.  for  a  specified 
set  of  input  vectors  is  given  by  the  following  likelihood  function 

A  =  -  l[P{y^V^)  (0) 

1=1 
A  l< 

t=ij=i 

The  learning  algorithms  that  we  discuss  are  all  maximum  likelihood  estimators.  That  is. 
we  treat  learning  as  the  problem  of  finding  parameters  6g.  6 j.  and  to  maximize  L.  or.  more 
conveniently,  to  maximize  the  log  likelihood  /  =  In  A 

A  A 

l{0.y)  =  ^ln^fy,(x‘'».6>^)P(y<'>|x''>.6>,). 

1=1  .,=1 

where  0  =  [0^.  6>, .  6>i .  •  •  • .  6>a  .  - 1:2- •  ■  •  •  ^A F - 

Given  the  probability  model  in  Eq.  1.  the  expected  value  of  the  output  is  given  as  follows 

A 

H  =  Tiyix]  =  ^(/,(x.0^)/x,. 

7^1 


This  motivates  using;  tlie  weiglittnl  output  of  the  exjx'it  networks  as  t  lu"  total  outi)ut  of  tlu'  MT 
arcliitecture  (cf.  Figure  1). 

The  model  in  Fqs.  (1)  and  ( 1 )  is  a  linit<>  gaussian  mixture  model.  It  i.s  interesting  to 
compare  this  model  to  a  relatc'd  gaiis.sian  mixture  mode)  lliat  is  widely  studied  in  statistics; 
i.c'..  the  model 

l<  I\ 

/>(x)  =  o,  >  0.  ^o,  =  1.  (7) 

./  =  !  ./  =  ! 

File  difference  between  these  models  is  clear:  the  o,‘s  in  F^cp  (  /  )  are  independent  of  the  input 
vectors,  while  the  cy/s  in  Kcp  (4)  are  conditional  on  x  (they  rejiresent  the  probabilities  I’{j\x)). 
Thus  mcjclel  (7)  represents  a  unconditional  |)robability,  appro])riate  for  unsupervised  learning, 
while  model  (  4)  represents  a  conditional  probability,  approjrriate  for  superviscnl  learnir.g. 

I'here  is  another  model  studied  in  statistics,  the  swiichiiuj  r((jr( .ssion  model  (Quandt  .V 
Ramsey.  1972.  1971^.  I)e  \>aux.  19S()).  that  is  intermediate  between  model  (7)  and  model  (4). 
The  switching  regression  model  is  given  as  follows 

P(i/|x)  =  XH{y\x.0i )  +  ( 1  -  X)P{y\x.B2).  (X) 

where  the  P(^|x.  0i)  are  univariate  gaussians  and  the  mean  of  each  gaussian  is  assumed  to  be 
linear  in  x.  This  model  assumes  that  the  data  pair  {^.x}  is  generated  from  a  pair  of  linear 
regression  models  through  a  random  switch  which  turns  to  one  side  with  probability  A  and  to 
the  other  side  with  probability  1  -  A.  This  model  can  be  generalized  to  allow  for  a  multinomial 
switch 

A 

7=t 

where  >  0.  =  1  and  F(y|x.0j)  is  given  by  Ecp  (1).  The  difference  between 

switching  regression  and  the  ME  model  is  that  the  switching  regression  model  assumes  that 
the  setting  of  the  switch  is  independent  of  the  input  vector.  This  assumption  does  not  allow 
for  piecewise  variation  in  the  form  of  the  regression  surface;  all  of  the  regression  components 
contribute  throughout  the  input  space.  Switching  regression  can  be  viewed  as  one  end  of  a 
continuum  in  which  the  overlap  in  the  regression  components  is  total:  decision  tree  models 
(e.g..  Breiman  et  al..  1984)  are  the  other  end  of  the  continuum  in  which  the  overlap  is  zero. 
The  ME  model  interpolates  smoothly  between  these  extremes. 

An  EM  algorithm  for  training  the  mixture  of  experts 

In  many  estimation  problems  the  likelihood  is  a  complicated  nonlinear  function  of  the  param¬ 
eters.  Parameter  estimation  in  such  cases  usually  involves  some  sort  of  numerical  optimization 
technique,  typically  gradient  ascent.  An  alternative  to  gradient  techniques,  applicable  in  many 
situations,  is  the  '•Expectation-Maximization”  or  "EM"  algorithm  (Baum,  Petrie.  Soules  A 
Weiss,  1970;  Dempster.  Laird  k  Rubin.  1977).  EM  is  based  on  the  idea  of  solving  a  succession 
of  simplified  problems  that  are  obtained  by  augmenting  the  original  observed  variables  with  a 
set  of  additional  "hidden"  variables.  Unconditional  mixture  models  are  particularly  amenable 
to  the  EM  approach  (Redner  i*.:  Walker.  1984)  and,  as  observed  by  .Iordan  and  .Jacobs  (in  press), 
tlie  conditional  mixture  of  experts  model  is  also  amenable  to  an  EM  approach. 

Given  an  observed  data  set  we  augment  >’  wdtli  a  set  of  additional  variables  called 

"missing"  or  “hidden"  variables,  and  consider  a  maximum  likelihood  problem  for  a  "complete- 
data"  set  2  =  {A’-T’mis}  (cf.  Little  Rubin.  1987).  We  choose  the  missing  variables  in  such 


4 


a  way  tliat  tli<'  it'siiltiii”,  •■coiiiplett'-data  lojs,  likf'liluHxl."  ,i>iv(Mi  !>>  1(0.3)  =  In  /'(3'.3',„,,|0). 
is  easy  to  maximize  with  respect  to  0.  I'lie  i)rol)al)ilit \  model  l‘{y.y,„,.s\0)  must  be  ( hosen 
so  that  its  marginal  distribution  across  Jt’.  referred  to  in  this  context  as  tin'  "incomph'te-data" 
likelihood,  is  the  original  likelihood 

p{y\0)  =  y  (10) 

In  deriving  an  update  to  the  paratneters  based  on  t lie  com])let e-data  log  likelihood,  we  hrst  note 
that  we  cannot  work  directly  with  the  com |)let e-data  log  likelihood,  because  this  likelihood  is 
a  random  function  of  the  missing  random  variables  Xiws-  ^  ke  idea  is  to  average  out  J',,,,,.  that 
is.  to  maximize  the  txixrlul  complete-data  log  likelihood  [In /^(  J'. 1  his  itlea 

motivates  the  EM  algorithm. 

The  E.\I  algorithm  is  an  iterative  algorithm  consisting  of  two  stei)s: 

•  The  Expectation  (E)  ste[).  which  cuinpulos  iho  followiiig  conditioiial  expectation  of  the 
log  likelihood 

g(0|0‘'>)  =  £y.„..{ln/>(2|0)|T.0'"} 

=  J  P(  y,,,...  IJ.  ) )  In  n  2\0)fiy„. (in 

where  0*^*  is  the  value  of  the  parameter  vector  at  iteration  Ax 

•  The  Maximization  (M)  step,  which  computes 

0(A+i)  ^  arginax  g(0|0<^M.  (E2) 


The  M  step  chooses  a  parameter  value  that  increases  the  Q  function:  the  expected  value  of 
the  complete-data  log  likelihood.  Dempster,  Laird  and  Rubin  (1977)  proved  that  an  iteration 
of  EM  also  increases  the  original  log  likelihood  /.  That  is. 

/(0(A  +  1).  J)  > 


Thus  the  likelihood  /  increases  monotonically  along  the  setpience  of  parameter  estimates  gen¬ 
erated  by  an  E.M  algorithm. 

Although  in  many  cases  the  solution  to  the  .\I  step  can  be  obtained  analytically,  in  other 
cases  an  iterative  inner  loop  is  recpiired  to  optimize  Q.  .Another  possibility  is  to  simitly  increase 
the  value  of  Q  during  the  .M  step 

g(0(^  +  l)|0(t))  >  Q(0(‘)j0(^))  (  l;{) 

by  some  means,  for  example  by  gradient  ascent  or  by  Newton's  method.  .An  algorithm  with 
an  M-step  given  by  Eq.  (13)  is  referred  to  as  a  (jf n( ralizf d  EM  ((.lEM)  algorithm  (Dempster. 
Laird  cV  Rubin.  1977). 

For  the  .ME  architecture  we  choo.se  the  missing  data  to  be  a  set  of  indicator  random  variables 
y.n..  =  1.---.A'.  /  =  1..--..V}  with 

j(t)  _  f  1-  if  generated  from  the  j-ih  model  given  by  Eq.  (1). 

■'  I  0.  otherwise 


o 


aiul 


=  1 ,  ft)!'  parli  /. 

\VV  assiiiiK'  that  tlio  distrihutioii  of  tho  roinploto  tlata  i<  s'lvoii  as  follows 

'  '' 

^=1  ./  =  ! 

It  is  easy  to  verify  that  this  distriltution  satisfies  K(|.  (10). 

From  Eq.  (11),  we  also  ohtain 

g(0|0‘'»)  =  Ey„„Jln/'(Z|0)|J.0<*>} 

=  ^^/t‘*'’(/)ln[</,(x*'*.0,^)/'(y‘'*(x*'>.^j] 

t=i j=i 

v  i<  ,\ 


=  ^^/)‘/'*(/)lii</j(x*'*)  +  ^/?*/’(/)lii  /'(y*'*|x<'>.0i 


^  =  1  .;=! 


+  •■•  +  t^/^^'’(/)ln/’(y''>|x<'*.0A). 


where 

hf\l)  =  £[/<'>!>'.  0<^>]  =  P(j|x‘'>.y''») 

Eiil  ff,(xF).^W)/>(y(P|xF).^|'  V 

where  P( j|x*'*.y<'*)  denotes  tlie  probability  that  the  pair  {x*'*.yO)}  comes  from  the  j‘'‘  prob¬ 
ability  model.  Note  that  wo  always  have  /i*^*(/)  >  0. 

With  the  Q  function  in  hand,  we  now  investigate  the  implementation  of  the  M  step.  From 
Eq.  (15).  Eq.  ( 1)  and  Eq.  (2).  we  have 


=  L  !:<■’(  0- 


/yjix^'Ke,, 


'a  t=i  j=\ 

S  h 


(=ij=i  ,=1 

/=i  ,,=i  ■ 


f>P(y<'>|x<^] 


/F(y''»|x'').0,; 


'J  /  =  !  ''''J 

^  a.)  ^(x(').6>,) 


=  L^'r^o- 


3:;'[y<'>-f,(x''>.0,)].  7  =  1. 


(=1 


G 


"  /  =  ! 

J  =1.--.A-. 

Hy  U'tting  _^.a  +  ii  =  0.  we  Dhtaiii  the  update  lor  tlie  covariaiK e  matrices 


,,) .  X:///*(n[y‘'’-f,(x<'».g,)][y*'»-f,(x<'>.^,)]^  (20) 

T.  i=ii>  i  (Of=i 

Assuming  that  the  training  set  y  is  generated  by  a  mixture  model.  w<>  note  tliat  when  the  tlie 
samitle  number  .V  is  sufficiently  large  (relative  to  the  rlimension  of  y).  the  space  si)anned  by 
the  A  vectors  [yO)  -  f^(  x*'*.  0, )].  /  =  1.--.A  will  be  of  full  dimension  with  probability  one. 
Recalling  that  l7\f)  >  0  we  observe  that  wlien  the  sample  number  .\  is  sufficiently  large  the 
matrices  +  therefore  positive  definite  with  probability  one. 


Next,  by  letting  _^i*  +  u  =  0.  we  obtain 


which  we  can  solve  explicitly  given  our  assumption  that  the  expert  networks  are  linear 


where 


0 


(x<'')^  0 


1  0 

0  1  0 


.Note  that  is  invertible  with  probability  one  when  the  sample  size  .V  is  sufficiently  large. 

Finally,  let  us  consider  the  update  for  d,j.  Jordan  and  .Jacobs  (in  press)  observed  that  the 
gating  network  is  a  specific  form  of  a  generalized  linear  model,  in  particular  a  niiiltinoinial  lorjH 
model  (cf.  McCullagh  k  Nelder.  19S1).  Multinotuial  logit  models  can  be  fit  efficiently  with 
a  variant  of  Newton's  method  known  as  ildritirfhj  rrwcightfd  hast  sqmirrs  (IRLS).  For  the 
purposes  of  the  current  paper,  we  simply  write  tlie  generic  form  of  a  .Newton  update  and  refer 
the  reader  to  .Iordan  and  .Jacobs  (in  press)  for  further  details  on  IRI.S.  .Note  also  that  .Jordan 
and  .Jacobs  (in  press)  assume  that  the  inner  loop  of  IRLS  fitting  runs  to  completion.  In  the 
current  paper  we  address  only  the  rase  in  wliich  a  single  IRLS  step  is  taken  in  the  inner  loop. 
The  form  of  this  IRLS  step  is  generalized  to  allow  a  learning  rate  parameter.' 


'Thus  the  algorithm  that  we  analyze  in  this  paper  is.  strictly  speaking,  a  (Jt.'.M  algorithm. 


I'lio  iiixialt'  for  llic  network  |)aram«>loi^  t>l)taiiitnl  a^  follows.  Doiiolc  tli<'  ^lad'K-nt 

vector  at  iteration  A-  as 


f=l  /=! 


and  the  Hessian  inatri.K  at  iteration  A-  as 


(  =  1  ,;=1 


Tlien  tlie  j^eneralized  IRLS  n|)date  is  jj,iven  as  follows 

0U  +  1)  ^  0(k)  (2s) 

where  -  ,y  is  a  learning;  rate. 

In  snminarv.  the  parameter  update  for  tlie  model  Krj.  (4)  is  given  as  follows 


Algorithm  1 

1.  (The  E  step):  Compute  the  by  Eq.  (10). 

2.  (The  M  step):  Compute  by  Eq.  (20).  compute  by  Ihp  (2S).  and  also 

compute  =  1.  ■  •  A‘  by  fjq.  (22). 


Before  closing  this  section,  let  us  return  to  the  switching  regression  model  (Eq.  9).  Following 
the  .same  procedure  as  above,  we  obtain  the  following  I'M  algorithm  for  switching  regression. 


Algorithm  2 


1.  (The  E  step):  Compute  the  //^*(/)'s  by 


o''>(x('')/>(y<'>|x'').<>) 

II  (t)  ~  - ^ - - - . 

EfL,o|''(xO))/>(y(O|x(O.0|*') 


2.  (The  M  step):  Compute  Dy  by  Eq.  (20).  and  let 


n</+’>(x<'>)  =  hf^l). 


Obtain  in  the  same  manner  as  for  model  Erj.  (4). 


VVe  see  that  the  EM  algorithm  for  switching  regression  is  simpler,  because  the  o*^*(x*'*)'s 
are  not  constrained  through  a  common  parameter  0,^  as  in  the  ME  model. 


Theoretical  convergence  results 

III  this  u<>  providi'  a  iiiiiiiIh'I  of  coiivoit'.ciicc  i<‘miIi.s  tor  llu'  ali^oi il  Inn  |)r(‘>(‘iii cd  in  lli4' 

|)i('\  ioiiN  M'cliiiii.  \\<'  study  Ixitli  tin'  <uiiv<'i'j;<'ii(<'  and  I  lie  (■i>ii\ <'rt>,cii(  c  rain  id  llin  alsoi  it  Inn. 
In  tlu'  .Vpix'iidix  \v<'  (>xl<'iid  llinsn  insulls  to  a  iniiiilx’r  of  r«‘lat('d  al<*(>ril  Inns. 

Wn  Ix’iiin  with  a  coiiva'i'^nnc*'  tli<’or<Mii  that  nslahlislms  a  mlal  ioiisliii)  Ix'twcnii  I  li<>  ll.\l  alno 
fit  Inn  and  s>ia(rn'ni  asix'iil. 


Theorem  1  l  oi-  Ihi  nnult  I  (/in  u  hii  Lr/.  (  j  anil  llu  Itarniiifi  iilijunlhiii  i/in  n  hi/  .  1 /(/<>/ ///nn  I. 
irt  I  Kin: 


nHr-\) 

:i 

-  0*/  *  = 

i,H\  1 

ite;e.,=e:  - 

qH  +  A 

J 

-  ---- 

r  \ '  ( *■  +  1  )  1 

r(<-[\j  1  -  r* 

1  ,  1  17 

'l  .  ■)  .  f\’  1  K-  _  \  '  *  '  •  *  •  •  o  • 

tdp 

irlii  n  1  —  In  L  is  (jinn  In/  hi, 

.  (H).  Fi/. 

i  'll  (111(1  Fif.  ( 1 ).  (111(1  "n<[A]~  ill  noils  llu 

n  rior 

obldinnl  hi/  sUk  I  iikj  IIk  rohiiini  nrlors  of  tin  iikiIiIj-  .1. 

Mon  on  r.  tinsimiiiKi  llial  tin  traiiiinii  n  I  J’  is  (/<  ik  mini  hi/  IIk  iiiijliin  iikhIi  I  of  lu/s.  ( j) 
(111(1  (!)  (111(1  (issiiiiiiiifi  llidt  IIk  iiiiiiiIk  r  .\  is  siifli(i(  iilh/  lariji .  in  li(ir(  Ihat  /'J**  is  a  jHisilin 
(li  fiiiil(  iii(itri.r.  (111(1  /’I**. /v*'^  j  —  \ .  ■  ■  ■ .  h  r//r  jxisilin  (lijiiiili  iii(iln((s  irilli  iirolHibililii  oi  i . 
SiHcificdlhj.  IIk  1/  l(il:(  IIk  folloiriiifj  niliKs: 

(ij  /*,***  =  irilli  (jinn  bi/  AV/.  (JS). 

(ii)  lor  j  =  l.  -.K.  /f  '  =  (/^‘/*)-‘  wilh  (/inn  bi/  Kij.  (JJ). 

( Hi)  I (II  j  1 .  •  ■  • .  A'. 


irli(  !■(  "  “  (l(  noil  s  III! 

Kvoiiicki  r  imxiurl  A 


y ' 

.L,i=i 


KvoikcIk  r  iirodiicl. 
n  is  (i(fiii((i  (IS 


~‘.l 


~‘.l 


(d2) 


For  (I  III  X  II  iiKili  ij-  .1  (111(1  (j  X  III  iiKilrij-  H.  till 


I  fi=  I 


(lull  (i\  2 II 

(l>\ll  <l2)ll 


fllnll  ' 

<l  2„  II 


V  >l,„\ll  •l,„2ll 


^hn  n  ) 


Proof.  F  Voin  fxis.  (A).  (  I)  and  ( I ).  (or  /  =  in  A.  wo  obtain  the  following  dorivalivos; 


()l 


^  ^  "4  ^  v~'  l\  I  1 1\  /ilA'Iirt/  4411  1 1  \ 


Oe,<H-0\r  r/,(x(''.0'/'l) 


V  /\ 


.til 


/=!  ,  =  1 


(le,, 


(Tl) 


ill 


~~  ^  _  /.  ^  77.  1.1  ..  777.  rn  _(M  ) 


Od/'i-n,  ^  ry,(x('>.0!/M/'(y<''|x<'>.0!*'’)  /'(yl')|x''l.0' 


(<•). 


?) 


.1 


{■il} 


f=l 

=  I.  -.  A. 


i>e. 


[y'"-  f,(x'''.0"'! 


lU 


^  ^  ZITl  1771  7  777  777  77T7~ ) ' 


/  =  ! 


i:'L,.V,(xf''.0!,^»)/'(y''>|x<''.0!^')  /'(y<'<|x'''.6»*/') 


=  -->11  /'!' V )(  -  [y'"  -  f,(x'''.  0'/’  )][y''>  -  f,(x''».  )]  0( 


/=! 

1.  ■.  A'. 


(;r. 


)('>)  >- 1 


i.i^) 


\\’<'  now  prove  pt>iiits  (i).  (ii)  and  (iii). 

(i)  ( 'oiiiparinsi  K(|.  {;{7{)  witli  I'xi.  ('ix)  it  follows  that  /’*/*  =  ';(/(’*/’)  '•  !<•  show  that  /') 
is  positive  (hdinite.  w('  show  that  Af/*  is  positive  definite.-  I'or  an  arhitraiv  vector  u.  from  Kip 
(2X)  we  have 


\  A 


=  -  .r|,(x<'>.0</>)]v^  v  >  0. 

t=l  ./=! 

since  ,(y,(x*'*. )[1  -  .f/;(xl'*. )]  >  0.  I-xpiality  holds  in  the  above  etpiation  only  when 
V  =  [o.s,ioe]]\x  =  0  for  any  u.  which  is  impossible,  riius  wo  have  established  that  a!/*  (and 
thus  also  (A!/*)~’)  is  positive  definite. 

(ii)  Let  L'-***.  A*^*  be  given  by  F.tp  (22).  From  Ftp  (21)  and  Frp  ( ;{-l ).  we  obtain 


,(A),a(A) 


01 


1 .  •  •  • .  A  . 


Furthermore,  it  follows  from  F(p  (22)  that 

i(A) 


^U  +  i)  ^  ’  +  )-'('] 

=  0*/*  +  (  A*/*)-‘[F'5‘'’  -  (  A*/'’)0*/’]. 


(A)^_v.(A)  Q[k) 


(;i()) 


That  is.  we  have 


and  =  (  A'/'*)-‘. 


o(A  +  l)  _  oO  )  I  I  ofA)^— I  I  ■  _  I  r- 

J 


A6> 


We  now  prove  that  A***  is  |)ositive  definite.  For  an  arbitrary  vector  u.  from  Fcj.  (22  )  we 


hav(' 


u'  a'/»u  -  ^/.‘/•>(/)u^V,(^‘;’)-'.V/u  =  ^//<;>(/)v''(v<;>)-.v  >  0. 


/=! 


/=! 


A  matrix  .1  is  positi%'e  definite  if  and  only  if  .1  ’  is  positive  dr  linite. 


in 


From  tlie  iioto  imiiK'diatoly  I'ollowiiig  F(|.  (  Hi)),  wo  know  tliat  given  In  K(|.  ('JO)  is  positive 
definite  and  invertible  fttr  eacli  A'  with  i)rol)ability  one.  1  lins.  v  itli  i)rol>al)ility  one.  the  ecpiality 
of  the  alnne  eipiation  holds  only  when  v  =  A’/u  =  0  for  any  u.  which  is  impossible.  I  bus  \v(' 
have  established  that  (and  thus  also  (/A*^*)~‘)  is  positive  definite  with  probability  otie. 

(iii)  We  consider  Ktj.  (20)  for  updating  I'liis  etpiation  can  be  e.xpaiided  as  follows 


v(^  +  i) 

J 


\'( 


v(^-) 


+ 


+ 


1 


V  (A)  -  f,(x<''.«,)][y>'>  -  f,(x<''.0,)]''  -  a'/* 

(0  f=i 


•>vHl 


-IV 


E, It ' 


(•H 


where 


-  [y*'*  -  f;(x<'>.0‘;')][y''>  -  f,(x'''.0';' )]■'■)(  A'l 


t  =  l 


It  follows  from  E<i.  (do)  that 


'I'hat  is.  we  have 


IV  = 


0/ 


23: 


(A) 


(J/ 


V  ( ^ )  _  ""  ./ 


vH'l 

J 


I’tilizing  the  identity  r(([.\BC'\  =  (C^  :  .-l)ivc[.B],  we  obtain 

r^(A  +  l),  2  ,„u->  „(At,  0! 


rcc 


-C 


E;lt/'r(o 


HA)  Moreover,  for  an  arbitrarv  matrix  f  .  we  have 


■T, 


=  tr((3:'/’f')^(3:*/'*f  ))  =  ]  >  0 

where  the  eciuality  holds  only  when  _  q  which  is  impossible  with  probability  one  since 

[  is  arbitrary,  and  3]*^*  is.  as  indicated  above,  positive  definite  with  probability  one.  Thtis  we 

have  established  that  is  positive  definite  with  probability  one.  □ 

Theorem  1  can  be  used  to  establish  a  relationshi])  between  the  stej)  taken  by  the  F:M 
algorithm  and  the  direction  of  steepest  ascent.  Recall  that  for  a  ])ositive  matrix  P.  we  have 
>  0.  This  implies  the  following  corollary. 


Corollary  1  Assudu  that  IIk  trahiiity  sf t  {y^'Kx^'Kt  =  1.---..V}  ronxs  front  tlx  niijitiir 
niodfl  of  Fxjs.  (4J  (xxl  (J)  otxl  that  \  is  siifficif  nfltj  hmjt .  11/7//  probobililij  otx .  tlx  sforch 
dirfclion  of  tlx  EM  alcjorithni  has  a  posit irt  projcrtion  on  thf  (/radiftit  ascrnt  srairliitxj  dirt rt ion 
of  I  =  hiL. 


If 


riiat  ih.  tilt'  EM  alsoritlini  can  In*  vit'Wt'd  as  a  iiuKlilit'tl  f>,ra(lit'nl  ascent  alsoritlini  lor 
niaxiinizing  /  =  In  E.  Eroin  riieorem  1.  li  clianges  with  the  iteration  step  A.  thus,  the  KM 
algorithm  can  also  he  regarded  as  a  variable  metric  gratlient  ascent  algorithm.  I  his  algorithm 
searches  in  an  uphill  direction,  so  if  the  learning  rale  is  appropriate,  the  searching  process  will 
converge  to  a  local  maximum  or  a  saddle  point  of  the  likelihood  /  =  In  /,. 

Similar  results  have  been  obtained  for  nnsupervi.sed  mixture  models  by  Xu  and  .Iordan 
{ 199;})  and  for  Hidden  Markov  .Models  by  Baum  ami  Sell  ( 19()K).  See  \n  and  .Iordan  ( 199.1)  for 
further  discussion  of  the  relationshiiis  between  these  theorems. 

We  now  utilize  a  result  from  Xu  and  Jordan  (199:})  to  establish  the  convergence  of  the 
parameters  0*^*.  We  also  provide  convergence  rates  for  both  /(0*^M  and  0*^*. 

Theorem  2  Assunn  that  tiu  tmininy  s<  t  is  gfiKiatfd  bij  thf  iiiii  tun  iiUHld  of  h'qs.  (4)(1) 
and  that  .V  is  sujfirit  ntly  laigt.  Assuiik  fiinht  r  that  (H'f  iliacjonal  and  Ut  vv^  bt  a 

rector  consist iny  of  the  diayonal  elements  o/X,. 

Let  as  delude 

Furthermore .  (mgunit  that  on  a  yiven  domain  Dq 

(0  •  J  =  l.-'-.A'.  /  =  1.  •••.»/  exist  and  are  cerntinuous; 

(ii)  the  Hessian  nieitrix  H(©)  is  negative  definite: 

(Hi)  0*  is  a  local  maxinnim  of  1(0).  and  ©“  G  Dq. 

Then  with  probability  one. 

(1)  Letting  -M.-m  (  here  M  >  m  >  0)  be  the  minimum  and  maximum  eigenvalues 
of  the  negative  elefinite  matrix  (  P-’ /A (0)(  P?  )  (or  egirivedently  the  minimum  and  maximum 
eigenveilues  ejf  PH(0).  since  we  have  PHe  =  Ae  from  {Pi  H  P^e  =  Xe).  we  have 

/(0‘)-/(0<^>)<  r^[/(  ©*)-/(  00 )].  (38) 

||F-1(0(')  -  0*)j|  <  l/f  /y^[/(0M-/(®o)].  (39) 

xvhere  r  =  1  -  ( 1  -  '^)^  <1-  HV  also  have  0  <  |r|  <  1  when  M  <  2. 

(d)  For  any  initieil  imint  0o  G  Dq-  linu- _ x  =  0'  when  M  <  2. 

Proof.  As  indicated  earlier,  when  the  training  set  {y*'*.x***,t  =  1.---.A'}  is  generated 
from  the  mixture  model  of  Eqs.  (4)(  1 )  and  .V  is  sufficiently  large.  X**'*  remains  positive  definite 
during  the  learning  process.  Thus,  under  the  condition  (i).  it  follows  from  Eqs,  (6).  (4)  and 
(1)  that  H{0)  exists  and  remains  continuous  on  D^.  Expanding  the  log  likelihood  in  a  Taylor 
expansion,  we  have 

t()1{0)  1  X 

1(0)  -  /(0')  =  (0  -  0*)^-^|0^0.  +  -(0  -  0-)TH(0-  +  ^(0  -  0M)(6>  -  0M 
with  0  <  <  1.  .Since  l0=0*  — 

/(0)-/(0*)=  ^-(0-0*)^//(0*+^(0-0'))(0-0-).  (40) 


12 


From  riieorom  1  wo  know  tliat  P  is  positive  detinite.  I'lirt hormoro.  from  condition  (ii). 
//(0)  is  nej»ativo  definite  on  Dq.  This  implies  tliat  p'l  exists  and  (  H(0){  )  is  negative 

definite  on  Dq.  I’tilizing  the  Rayleigh  ipiotient  we  obtain  that  for  any  u. 

-  .l/||ul|"  <  II(0)(P^  )u  <  -///l|u||‘.  (-11  ) 


Substituting  Fij.  (11)  into  Fii.  (10).  we  obtain 
1 


i{0)  -i(0')  =  -^{0  -  0' ( p-h' { n (0'  +  ^(0  -  0’))P^p~n0  -  0' 


M . 


.Moreover,  we  have 


-//) 


Thus 


/(0)-/(0‘)  >  -  — 1|/'->(0-0 


■^(0-0*)||^  >  l(®-®*)’[^-^i0=0*]l 

>  -i|Pi^|iliP-M0-0-)|| 


|lp-i(0-0*)||<^||H^i 


Together  with  Eq.  (10).  we  obtain 


-||ci^ll'<  ^|/(©l -/(«■)]. 


c)0 


.1/ 


(42) 

(40) 


(44) 


On  the  other  hand,  we  also  have 


/(0)-/(0<^')  =  ^o,  +  i(0-0<^')^'//(0<^>  +  C'(0-0'^>))(©-0<^'>) 

()0  2 

with  0  <  ^'  <  1.  By  Theorem  1.  we  know  that  for  the  EM  algorithm.  0f^+ri  _  0(*^)  — 
P result  in  the  above  equation,  w'e  obtain 


/(0('‘  +  l))  _  /(0(A))  ^ 


+ 

> 


i(4/(0),.2 

00  ''0=0“' 
1  ,  „L()l{0) 


lOI(0), 


(1-f 


01(0) 

00  "©=0“’ 


00  '0=0“'^ 

(40) 


Combining  Eq.  (45)  and  Ecj.  (44).  we  obtain 


/(0(A  +  l))  _  /(0(A))  >  -(1  _  :^)^[/(0(A))  _  /(0»)] 

/(0(A+i))_/(0.)  >  [1  _(i_:^)^][/(0(A))_/(0*)] 

>  [1-(1  -  y)^]1/(©u)-/(0*)]. 


(40) 
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and  furthermore 


Let  r  =  [1  -  ( 1  -  Multiplying  l)otli  sides  of  tlie  al)ov(>  etjuation  In  negative  one. 

we  obtain  Etj.  (dN).  In  addition,  it  is  t'asy  to  verify  that  0  <  |/i  <  1  when  M  <  2  (rp<all  that 
M  >  III).  Flirt hennore.  it  follows  from  Fq.  (  II)  and  Ftp  (42)  that  we  have 

/(©<^>)  - 1(&')  <  -  0')ir- 

which.  by  Fq.  (db).  betomes 

-  0')||-  >  /•*■[/( 0u)  -  /(0M] 

||p-7(0(^)  -  0‘t||  <  I [/(©•)-/( 00 )] 

which  is  just  Fq.  (dO).  In  addition,  when  .\!  <  2.  jrl  <  1.  we  have  liiin_x  0**^  =  0’  since  F 
is  positive  definite.  □ 

We  see  from  this  theorem  that  the  FM  algorithm  converges  linearly.  Moreover,  the  speed  of 
convergence  depends  on  the  difference  between  .M  and  in:  the  smaller  the  difference,  the  faster 
the  convergence. 


Theoretical  analysis  of  an  EM  algorithm  for  the  hierarchical 
mixture  of  experts  architecture 

An  EM  algorithm  for  training  the  hierarchical  architecture 

The  ME  architecture  can  be  viewed  as  an  architecture  for  splitting  the  input  space  into  regions 
in  which  different  local  functions  are  fit.  The  hierarchical  mixture  of  experts  (HME)  architecture 
generalizes  this  idea  to  a  nested  model  in  which  regions  in  the  input  space  are  split  recursively 
into  subregions  (Jordan  k  Jacobs.  1992).  The  resulting  tree-structured  architecture  can  be 
viewed  as  a  multi-resolution  function  approximator  in  w’hich  smoothed  piecewise  functions  are 
fit  at  a  variety  of  levels  of  resolution. 

As  shown  in  Figure  2.  the  HME  architecture  is  a  tree.  In  this  tree,  each  terminal  node  is  an 
expert  network,  and  each  nonterminal  node  is  a  root  of  a  subtree  which  itself  corresponds  to 
an  HME  architecture.  At  every  nonterminal  node  in  the  tree  there  is  a  gating  network  which  is 
responsible  for  the  topmost  split  of  the  HME  architecture  rooted  at  that  node.  .\11  of  the  expert 
networks  and  the  gating  networks  in  the  architecture  have  the  same  input  vector  x  G  ■  In 
the  remainder  of  this  section,  as  in  Jordan  and  Jacobs  (in  press),  we  consider  the  case  in  which 
the  expert  networks  and  the  gating  networks  are  generalized  linear  models.  Furthermore,  for 
simplicity,  we  consider  only  the  case  in  which  the  probability  model  for  the  experts  is  gaussian. 
Let  us  denote  a  node  at  depth  r  by  c,o, This  node  is  the  /,.-th  daughter  of  the  node 
•  The  root  node  of  the  tree  is  The  number  of  branches  emitted  from  r,o, is 
denoted  by  .  For  simplicity,  we  can  omit  /q  and  write  I'i^...,,  and  .  In  addition, 

the  output  of  the  subtree  rooted  at  is  denoted 

A,,  ,, 

yil-'-h  —  Xw  ^‘1  1^-  ^(1  ••  i,  ■Alt  +  l 

/r+I  =1 

where  </,, gnting  coefficient  generated  by  the  gating  network  attached  at  c, . 
This  coefficient  satisfies 

A  ,j  ,, 

=  1. 

'r  +  l  =1 
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Figure  2:  A  two-level  hierarchical  mixture  of  experts.  To  form  a  deeper  tree,  each  expert  is 
expanded  recursively  into  a  gating  network  and  a  set  of  sub-experts. 

for  any  x.  where  is  the  parameter  vector  of  the  gating  network. 

Given  a  training  set  T  =  /  =  1.---.A'}.  we  want  to  ma.ximize  the  likelihood 

function  (cf.  Eq.  6).  that  is. 

L  =  P({y<*)}f'|{x<'>}f  )  =  JJ  F(y<V'’)- 

/  =  ! 

Expanding  the  probability  model,  we  have 


where 


P(y<^>|x'').r„)  = 


if  is  a  nonterminal  node. 

(27rdet 

if  r,,  is  a  terminal  node 


IT) 


and  ipcursivt'lv 


,,1,  ,,v  ,  if  n,, ....  is  a  iiontorminal  node. 

if  f'lj...,,  is  a  tei'iniiial  node 

(l.S) 

wliero  f,, is  paranietprized  fimctioii  iiupknnented  by  the  oxix'it  network  at  terminal  node 
r,,...,,  .  and  0,,...,,  .  iii't'  the  parameters  of  this  expert  network:  r,, )  is  the 

probability  that  y*'*  is  generated  from  the  probability  model  rooted  at  when  x*'*  is  the 

input . 

To  derive  an  EM  algorithm  for  the  HMfi  architecture,  we  attach  a  set  of  indicator  random 
variables  to  each  nonterminal  node  c, 

^(0  _  f  1'  if  y***  i^  generated  by  the  subtree  rooted  at  r,, . 

1  0.  otherwise. 

The  missing  data  T’mis  consists  of  all  of  the  indicator  variables  attached  to  the  nonterminal 
nodes  throughout  the  tree.  In  addition,  we  denote  by  sp*  consisting  of  the  indicator 

variables  attached  to  the  nonterminal  nodes  in  the  subtree  rooted  at  r,, ...,y. 

We  define  the  distribution  of  the  complete  data  2  =  {.T.  T’mis}  as  follows 

■V  ^'•0 


P{2\0)  =  n  n  [»..  (x<''.0fjP(2,Jx»''.c„)]'M 


(=1 11=1 


where 


F(2„|x<'>.r,J  = 


if  e,,  is  a  nonterminal  node, 
if  r,,  is  a  terminal  node 


and  recursivelv 


P(2„...,v|x<'>.  r„...,v)  = 


if  e,, is  a  nonterminal  node, 
if  c,, is  a  terminal  node 


It  is  not  difficult  to  verify  that  this  distribution  satisfies  Eq.  (10)  as  required. 

We  now  compute  the  Q  function  as  required  by  the  E  step  of  the  EM  algorithm.  From  Eq. 
(11).  we  obtain 

Q(0|0''))  =  i:  /i!^*(/){ln[.9„(x(').^^  )]  +  In  } 

/  =  !  /1=1 

where 

/?!>■>(/)  =  £’[/<'’!>'. 0<^>]  =  — .  (.52) 


l(i 


1„  y  ^  I  Ef,  =  i  )]  +  111  /;,,,}.  if  (•,,  is  a  noiiteniiinal  iio<l<>. 

111 /•,! ).  if 1  is  a  terminal  node 


(0 


•  ^  / 1  /  ■  ) 


and  recursivelv 


if  (',11,  is  tt  iiontenninal  node. 
In  P(y<')lx(').  (•„...„  ). 

if  I',,...,,  is  a  terminal  node 


(od) 


(0 

'l-'-'rO  +  l'  ' 


■VII 


where  ^  is  the  estimate  of  ^  at  iteration  k.  P(y*'’|x*'*.  )  is  given  by  Eqs.  (47)  and 

(4S)  and  P*^'*(y***|x*'^  t’/,  ..,, )  means  that  the  probability  is  determined  with  all  the  parameters 


in  the  subtree  rooted  at  being  fixed  at  the  estimates  obtained  in  iteration  k. 

Proceeding  now  to  the  M  step  of  the  EM  algorithm,  we  obtain  parameter  updates  by 
optimizing  the  Q  function.  If  the  node  r,, is  a  terminal  node,  by  setting  the  partial  derivative 
of  Q  with  respect  to  equal  to  zero,  we  obtain  an  update  for  the  covariance  matrices 


^(.M)  _ 


To  obtain  an  update  for  the  parameters  of  the  expert  networks  we  differentiate  Q  with  respect 
to  and  find  that  we  must  solve  the  following  equation 

*']• 


Of  T  (yIO  fi  1 


(=1 


In  the  case  of  linear  expert  networks,  this  equation  is  a  least  squares  equation,  which 

can  be  solved  as  follows 


^(<■•+1)  _/ /?(<■-)  1-1 J'') 


where 


and 


.V 


(!>■) 


t=l 

v 


f  vl/’l  . 


(57) 


(=1 

Finally,  for  any  nonterminal  node  c,,...,,.  setting  the  partial  derivative  of  Q  with  respect  to 
^quai  to  zero,  we  have  that  is  the  solution  of  the  following  nonlinear  system 


1 1  •••ir 

.V 


(=1 


d  +  l  =1 


do^  , 
/]  •••/,- 
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As  in  tlu>  case  of  the  ono-levol  MIC  arcliiteci lire  (cf.  (2())  and  Kq.  (2>')).  vve  obtain  tlie 

following  Newton  stej)  for  updating  the  gating  network  parameters 


1 1  ■  I,  ~  "(I  ■■■I, 


+  hi  -I,  ( 


'd*') 


-1  7(A) 


(■W) 


where 


and 


/o,  „ 


1=1 


<  > + 1 — 1 

X  fl  -  «  lx*"  -  <.  +  1  ■ -^>1 


Ei'lhV'nlw-'-i'hj' 


t=\ 

X  E  ( 0  -  «/„ ( x<". )]-^  ,, 

0+1=1 


(1)  /i7(A)  ,1  I  ■•■',  + I 


(bO) 


(bl) 


As  in  the  case  of  the  one-level  architecture  (cf.  Eq.  (28))  (he  algorithm  in  FCq.  (.59)  is 
essentially  the  same  as  the  IRLS  algorithm  suggested  by  Jordan  am'  Jacobs  (in  pre.ss).  although 
we  have  introduced  a  learning  rate  parameter  and  we  restrict  the  update  to  a  single  step. 

In  summary,  the  EM  algorithm  for  the  HME  architecture  is  given  as  follows. 


Algorithm  3 

1.  (The  E  step);  Compute  the  ■  ■  • .  )  by  Eqs.  (.52).  (.53)  and  (.54). 

2.  (The  M  step):  Compute  the  by  Eq.  (.5.5).  compute  the  by  Eq.  (.59).  and 

compute  the  by  Eq.  (.57). 


VVe  can  think  of  the  E  step  as  assigning  credit  (posterior  probability)  to  various  branches  of  the 
tree  for  each  data  point  and  the  M  step  as  solving  weighted  least  squares  problems  in  which 
the  weights  are  given  by  the  posterior  probabilities  assigned  in  the  E  step.  The  updates  for  the 
gating  networks  simply  cache  away  the  posteriors. 


Theoretical  convergence  results 

Much  of  the  analysis  developed  in  Section  2.2  can  be  extended  to  cover  the  EM  algorithm  for 
the  HME  architecture.  In  this  subsection  we  extend  Theorem  1  and  Theorem  2  to  cover  the 
hierarchical  case.  The  results  are  given  as  Theorems  3  and  4.  respectively. 

We  first  compute  the  derivatives  of  the  Q  function 


OQ 


dB^ 


\ 

Z 

f=i 


IS 


X 


((>2) 


',  +  1  =  1  ''m-i 


(>Q 

oe,,...,, 


Df  ^  ^ 

. .ft,.: . "2...,(y"'-r, . . ,1. 


5#-  =  -ii:*!:'i'i/,!;;',('i--/.!:'„(')i;-'.„x 

'I-"  ^  /=! 

X  -  [y'"  - f„  ..,(x'''.e„,.,. Iiiyi'i  -  f„  ((ni 

and  tlio  dt'riva lives  of  t lie  log  likelihood 

()l  - - -  /  J.l  ,  V 


O'L,,..., 


dffl  ■ 

1 1  ■■■', 


',  +  1=1  ‘''‘All-', 


l  I  ,,.r  > 

,  =<‘>0’'  •  (<>•>) 
'1  "  ''m  - 


X  . ..  »i. 


V*  _ (  ^'  ) 


=  -n  E  ■  ■  •''1*'.,,  (Oisl'l,,  r‘  X 


Based  on  these  two  sets  of  derivatives,  we  follow  the  same  line  of  thought  as  in  the  proof  of 
Theorem  1  to  establish  the  following  theorem  for  the  HME  architecture. 

Theorem  3  For  tlu  IIME  archil  frlitn  of  Eqs.  (/,7)  and  (fSJ.  irilh  thf  jHirarnftcr  iifHlatfs  (jicfii 
by  Algorithm  3.  irt  hare  that  for  f  irry  ikhIi 


>)i_  ...x,rv(^-)  • 


- '’-k:'.,  j  =  ft, 


puW 

(H 

'  i\-i, 

do^;  , 

p(^') 

01  ’ 

O0„...„ 

p(^) 

01 

* 

-^ij 

"  fh’cc[E,,, 

when  I  =  In  L  is  dr  fined  by  Eqs.  (6).  (f7)  and  (fS). 
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Monovfv.  ussunmtc]  that  tin  traiiiiiiij  I  y  is  i/tiifratnl  bi/  tin  H.Mh  iinuh  I  of  ii/s.  f  j// 
<tii<l  (-{''i).  and  that  tin  nitnibir  .\  is  suffirit  ntly  laifn.  linn  i''  a  iMjsilirt  (h  finilt  inatiiu. 


<"^<1  ftlr 

art  po.' 

■^itive 

dejinite  matrices  with  piobability  one .  .S 

jui  ijii  ally,  they  lake 

the  following 

valu 

t  s: 

(i) 

—  1 

with 

(<o  if!.,. 

,  =  ( 

.-1 

with 

ifH) 

;  given 

by  Fep  (2,7). 

(Hi)  For 

,  we  hat 

' » 

V 

«t 

-1  tl 

2 

\iKl 

V  (  ^' )  v  (  ^' ) 

. 

(<>9) 

irhfVi  “  dmotfs  tin  Kromcktr  prod  act  as  dffiind  in  Tin  on  in  i. 


From  Theorem  5.  we  can  again  reach  Corollary  1.  Again,  we  see  that  the  FM  algorithm 
for  training  the  HME  architecture  is  a  type  of  variable  metric  gradient  ascent  algorithm  for 
maximizing  /  =  In  T. 

Finally,  we  can  also  generalize  Theorem  2  as  follows. 

Theorem  4  Asuunn  that  tin  training  stt  y  is  yunratfd  by  tin  HME  nnxli  1  of  E.qs.  (^7)  and 
(4fl)  (Hid  that  tin  numbfr  .V  is  sufficn  ntly  laiyt .  Assnnn  that  is  diaijonal  and 

i.s  a  Victor  consisting  of  th(  diagonal  (U  nnnts  o/H, . 

L(t&  bi  a  Victor  prod  uc(d  by  cascading  ivery  actor  of 

ivcry  nodi  V)^...i^  in  tin  HME  arcliiticlun.  Let  P  b(  a  diagonal  block  matrix  ivith  iach  diagonal 
Him  biing  a  positii't  dUnjoind  block  matrix  i,  ]•  items  of 

&  and  P  art  arranged  in  such  a  way  that  in  P  cornsifonds  to  v,, in  0. 

Furthcrnion .  assume  that  on  a  given  domain  D& 

(i)  Thi  parameterized  functions  of  all  the  exiHit  networks  and  tin  gating  networks  have 
second  order  continuous  derivatit'es. 

(ii)  The  Hessian  matrix  H(0)  =  negative  definite: 

(Hi)  0*  is  a  local  maximum  of  1(0).  and  0*  G  Dq. 

Then  we  have  the  same  conclusion  as  ghwn  in  Theore  m  2.  That  is 

(1)  Letting  —M.—m  (  here  M  >  m  >  0)  be  the  minimum  and  maximum  e igtnvalues  of 
the  negatively  definite  matrix  (P^)‘  H(0)(P^)  (or  eejuivalt ntly  the  minimum  and  maximum 

1  7’  1 

eigt mmhies  of  PH(0).  since  we  have  PHe  =  Ae  from  (P'i)'  HP^e  =  Aei.  we  have 

1(0')  -  /(0<*>)  <  /•[/(0*)  -  /(0o)]. 

||p-7(0<^)  -0*)||  <  |/f/'y|[/(0*)-/(0o)]. 

where  r  =  1  —  ( 1  —  ^  )^  <  1.  li  t  also  have  0  <  |r|  <  1  when  M  <  2. 

(2)  For  any  initial  point  0q  £  F)q.  linu—„:<  0^**  =  0*  when  M  <  2. 

VVe  should  point  out  that  the  similarity  in  the  conclusions  of  Theorem  2  and  Theorem  4  does 
not  mean  that  the  EM  algorithm  for  the  hierarchical  architecture  has  the  same  convergence 
rate  as  that  for  the  one-level  architecture.  In  the  two  cases  the  matrix  P  is  different,  and  thus 
M.  m.r  are  also  different.  This  results  in  different  convergence  rates.  Indeed,  in  practice,  the 
hierarchical  architecture  is  usually  faster.  The  similarity  in  the  conclusions  of  the  two  theorems 
does  mean,  however,  that  the  convergence  rates  for  both  the  algorithms  are  of  the  same  (linear) 
order. 
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Variants  of  the  EM  Algorithm  and  Simulations 

Variants  of  the  EM  algorithm 

For  coiivoiiioiico.  wo  doiioto  an  F.M  iipdato  of  llio  |)araiiiot(M-  voctor  as  follows 


CO) 


From  1  heorems  1  and  2  ainl  riiooronis  d  aiul  I.  wo  soo  tliat  tliis  n|)dato  is  aotnally  a  lino  ^('arcli 
motliod  along  an  ascent  direction  of  /  =  In  I. 


CD 


with  being  a  positive  definite  matrix  evaluated  at  Moreover  this  update  has  a  linear 

convergence  rate.  This  link  between  the  F.M  algorithm  and  conventional  gratiient- based  opti¬ 
mization  techniques  suggests  the  possibility  of  using  acceleration  techniriues  for  improving  the 
convergence.  In  the  sequel  we  suggest  two  such  acceleration  terhni<|ues. 


Modified  line  search 

Eq.  (71)  can  be  replaced  by  a  modified  line  search 

Q(k+i)  ^ 

=  (72) 

where  Aj^.  is  a  stepsize  which  is  optimized  by  maximizing  /(0**’*  +  Ai-d^  )  with  respect  to  A/,  via 
a  one-dimensional  search  (e.g..  Fibonacci  search). 

The  implementation  of  a  one-dimensional  optimization  method  at  every  parameter  update 
is  typically  expensive.  One  often  uses  an  inaccurate  line  search  by  decreasing  (e.g..  A^  —  rA^. ) 
or  increasing  (e.g..  A^.  —  ^A;. )  the  stepsize  heuristically  according  to  a  stopping  rule.  One 
frequently  used  stopping  rule  is  the  .so-called  Goldstein  test  (Luenberger.  19H4).  The  Goldstein 
test  is  implemented  as  follows 


/(AD  < 
/(Aa.)  > 

I'm  = 


/(0)-b5/'(0)AA.. 
/(O)-h(l  -f)/'(0)AA.. 

d[^u  =  »<'•). 

A 


where  0  <  5  <  1  is  a  specified  error  bound. 
Interestingly,  if  we  rewrite  the  update  as 


(73) 


0(A  +  t)  ^  ^(A)  ^  =  (1  -  Aa.)0*^>  +  AaT;,(0‘^’)- 


we  find  that  Eq.  (72)  is  identical  to  the  speedup  technique  for  the  EM  algorithm  studied 
by  Peters  i:  Walker  (197f^a.b).  Meilijson(  1989)  and  Redner  A'  Walker  (198-1).  These  authors 
reported  a  significant  speedup  for  an  appropriately  selected  Aa-  even  without  the  Goldstein  test. 
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A  speeding  up  formula  based  on  locally  linearization 

I'siiig  a  first-order  d’aylor  ex()aiisioii  of  t  around  0**  we  have.  a|)|)roxiniately. 

Q(ui}  ^  m0(k)  _  ^(A-i)j 


or 

A6»a  = (71) 

where  B  =  and  A^a.  =  -  d'*’. 

For  the  matrix  we  have  the  following  characteristic  e(|nation 


de\{  XI  ~  B )  —  X”  \  X"  *  -l-  •  ■  •  +  I A  -)-  //„  —  A  ‘  -f-  ^  ~  ^ij  X  ■'  —  0 

/=i 

/'./=(-!)'  X!!  •  • -A,^  (j  =  1.2.---.n) 

where  A,.  /  =  1,  are  the  eigenvalues  of  o.  It  follows  from  the  ( 'ayley-Hamilton  theorem 

t  hat 

>1 

Z?"  +  =  0 

Multiplying  by  ^dk-n  (^'  >  '0-  we  have 


B'>A0^,.-n  +  ^//,5"-'A0x_„  =  0. 

j=i 

From  Eq.  (74).  we  obtain 

A0A-,  =  flA6>A-,-i  =  •  •  •  =  £f”-^A6>A_„.  j  =  0. 1.  •  •  •.  «. 


lO 


Substituting  into  Eq.  (75).  we  have 

ft 

Ae,  +  Y.I‘j^^k-j  =0.  (7(i) 

./=1 

.-Vssuming  that  in  comparison  with  the  first  /  eigenvalues,  the  remaining  eigenvalues  can  be 
neglected,  we  have  approximately 

/  I 

^"  +  I]  H  f’j  ^ 

j=i  ./=) 

and  correspondinglv 

/ 

0,  -|-  y  ]  f^i j^0,-j  =  0.  /  =  A'.  A’  -)-  1 .  •  •  • .  ( <  < ) 

J=1 

The  approximation  becomes  exact  when  the  last  u  —  I  eigenvalues  are  zero. 

By  minimizing  ||0a  +  ./ 11^-  we  obtain  the  following  linear  equation  for  solving 

S  fJ,  =  So-  S  =  [s,j]/x/-  So  =  [  — •''Ol-  — •'*02-  •  •  •  • 
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Mc)i(H)V('I'.  \v('  liavc 


^  /  /  X  / 

^  A0,  +  ^  ^  //,A0,_,. 

I  =  k  J=U  .1=^  i=k-\-{  ,/=I  ;=/  +! 

wliorc'  //,,  =  1.  From  Fq.  (77)  and 

Y,  /'.,A0,_,  =  -A0,+  ,  +0- 

I 'zzk'-^  1 

(wlipro  0‘  =  liiUA_x  A^a  ).  wo  tiavo 


I 

—  ■^0k+\  +  0  +  /'j(  —~^0k  +  ]-i  0  )  =  d. 

./=! 


Fsing  tliis  o(iuation  togetlier  with 

I  I  l-\  l 

Yfi.A^k+\-.,  =  Afl^.+i  Yt'j  -  H  /'jA^t-, 

,/=0  ,/=•)  i=0  ./  =  I+1 


wo  finally  obtain 


^I-+i 


=  0 


*■+1 


E./=o  I'. I 


This  formula  can  bo  used  for  spooding;  up  tlio  KM  algorithm  in  two  ways. 


(79) 


•  Civen  an  initial  0o-  wo  compute  via  the  EM  update  0o-0\-  ■  ■  -.01+1.  and  then  from  0j.j  = 
1. ••••/.  we  solve  //i. •••.///  utilizing  Eep  (7>f).  This  yields  a  new  0'^,  via  E(|.  (79).  Wo 
then  let  0o  =  0J_^^  and  repeat  the  cycle. 

•  Instead  of  starting  a  new  cycle  after  obtaining  01j^\~  we  simply  let  0ij^i  —  0J^].  and  use 

the  EM  update  (Eq.  70)  to  obtain  a  new  0/+>.  then  we  uso  0|.0i.  •  •  -  .01+2  ^  ^J+2- 

Similarly,  after  getting  01-.  >  /.  we  let  0f;  =  01  and  use  the  EM  u])dato  to  goi  a  now 

0k-+  \  ■  and  then  use  0k—i-  ■  •  •  ■  0k+i  to  get  a  0l^i  ■ 

Specifically,  w  hen  /  =  1.  we  have 


^A  +  l 


=  e,+ 


^0k 
1  +  /<!  ’ 


/'I 


(AgA)’^(AgA_l) 

(A6>A._,)t(A0A_,)’ 


(SO) 


In  this  case,  the  extra  computation  ro<|uired  by  the  acceleration  tochni(|ue  is  (piite  small,  and 
wo  rocommend  the  use  of  this  approach  in  practice. 


Simulations 

Wo  conducted  two  sets  of  computer  simulations  to  compare  the  performance  of  the  EM  al¬ 
gorithm  with  the  tw'o  variants  described  in  the  previous  .section.  The  training  data  for 
each  simulation  consisted  of  1000  data  points  generated  from  the  piecewise  linear  function 
y  =  m  r  -f  (i2  +  »(.  G  [•rf,..r(  ]  and  y  =  a\.r  +  +  Uf  ■>'  G  where  iif  is  a  gaussian 


raiulDiu  v;uial)lf  with  /t'rt)  iiicaii  and  variamc  n  —  ().;{.  IVaininfi  data  wei**  sampled  lr<im  the 
first  lunctidii  witli  |)i'()l)al)ilit y  0.1  and  from  the  second  Innction  witli  prol)al)ilit \  O.O. 

W'e  stndii'd  a  inodidar  architeci urt'  witli  A  =  '1  expert  networks.  1  In'  ('xperts  were  liin'ar; 
that  is.  were  liin-ar  i'unctions  [.r.  For  the  natins;  net.  we  have 

.S,  =  [.r. 


wliere  (j I  is  sjven  hy  I'l].  12).  For  simplicity,  we  updated  O.n  by  f>,t;<‘lient  ascent 

i(  ! ) 


e\ 


(M) 


riu'  learninj*,  rate  |)arameter  was  set  to  c,^  =  O.Oo  for  the  first  data  set  and  /  =  0.002  for  the 

second  data  .set.  We  used  E(|.  (20)  and  Kq.  (22)  to  update  the  parameters  j  =  1.2  and 

^(k+\)  j  _  j  respect ivelw 

The  initial  values  of  0*"*.  j  =  1.2  were  picked  randomly,  'lb  comirare  the  irerlor- 
mance  of  the  algorithms,  we  let  each  algoritlim  start  from  the  same  set  of  initial  values. 

The  first  data  set  (see  Figure  d(a))  was  generated  using  the  following  parameter  values 


nl  =  O.N.  (,2  =  O.f.  .1-1,  =  -1.0.  .11  =  1.0.  «;  =  -1.0.  =  d.O.  =  2.0.  ./{•  =  -1.0. 


'File  performance  of  the  original  algorithm,  the  modified  line  searcli  variant  with  =  1.1.  the 
modified  line  search  variant  with  A;  =  0..">.  and  the  algorithm  based  on  local  linearizafitJH  are 
shown  in  Figures  1.  •').  (i.  and  7.  respectively.  .\s  seen  in  Figures  -1(a)  and  5(a).  the  log  likelihood 
converged  after  19  ste|)s  using  both  the  original  algorithm  and  the  modified  line  search  variant 
with  Aa-  =  1. 1.  VVhen  a  smaller  value  was  used  (Aa  =  0.5).  the  algorithm  converged  after  2  1  steps 
t  Figure  0(a)).  Trying  other  values  of  A^.  we  verified  that  Aa  <  1  slows  down  the  convergence, 
while  Aa  >  1  may  .sj)eed  up  the  convergence  (cf.  Redner  iV  Walker.  lOfsl).  We  found,  however, 
that  the  outcome  was  (piite  sensitive  to  the  selection  of  the  value  of  Aa.  For  e.xample.  setting 
Aa  =  1.2  led  the  algorithm  to  diverge.  .-Vllowing  A,;,  to  be  determined  by  the  (loldstein  test 
(Ecp  7d)  yielded  results  similar  to  the  original  algorithm,  but  required  more  computer  time. 
Finally.  Figure  7(a)  shows  that  the  algorithm  based  on  local  linearization  yielded  substantially 
improved  convergence  -the  log  likelihood  converged  after  only  steps. 

Figures  1(b)  and  -1(c)  show  the  evolution  of  the  parameters  for  the  first  expert  net  and 
the  second  expert  net.  respectively.  Comparison  of  these  figures  to  Figures  5(b)  and  5(c) 
shows  that  the  original  algorithm  and  the  modified  line  search  variant  with  Aa  =  1.1  behaved 
almost  identically:  6\  converged  to  the  correct  solution  after  about  IS  steps  in  either  case. 
F'igures  0(b)  and  0(c)  show  the  slowdown  obtained  by  using  Aa  =  0.5.  Figures  7(b)  and  7(c) 
show  the  improved  performance  obtained  using  the  local  linearization  algorithm.  In  this  case, 
the  weight  vectors  converged  to  the  correct  values  within  7  steps. 

Panel  (d)  in  each  of  the  figures  show  the  evolution  of  the  estimated  variances  (jf.rr^.  The 
results  were  similar  to  tho.se  for  the  expert  net  parameters.  .Again,  the  algorithm  ba.sed  on  local 
linearization  yielded  significantly  faster  convergence  than  the  other  algorithms. 

A  .second  simulation  was  run  using  the  following  parameter  values  (see  F'igure  ;{(b)) 

nl  =  O.S.  al  =  0.1.  .VL  =  -1.0.  .it  =  ’2.0.  o',  =  -1.2.  o'^  =  2.4.  .v',  =  1.0.  .v',  =  f-O. 


'fhe  results  obtained  in  this  simulation  were  similar  to  those  obtained  in  the  first  simulation. 
The  EM  algorithm  converged  in  11  steps  and  the  local  linearization  algorithm  converged  in  (i 
steps. 


2-1 
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Figure  5:  The  performance  of  the  linesearch  variant  witli  A^.  = 


variances  parameters  (expert  2)  parameters  (expert  1 )  log  likelihood 


The  results  f'roui  a  iiiiiiiher  of  other  sinmiatioii  experiments  couhrmed  the  results  reixirted 
liere.  lu  general  the  algorithm  based  on  local  linearization  provided  signiticantl\  faster  con¬ 
vergence  than  the  original  KM  algorithm.  I'lie  modilietl  line  s<'arch  variant  did  not  apjx'ar  to 
converge  faster  (if  the  parameter  was  fixed).  We  also  tested  gradient  ascent  in  these  exper¬ 
iments  and  found  that  cotivergence  was  generally  one  to  two  ortlers  of  magnitude  slower  than 
convergence  of  the  KM  algorithm  and  its  variants.  Moreover,  convergence  of  gradient  ascent 
was  rather  sensitive  to  the  learning  rate  and  the  initial  values  of  the  parameters. 


Concluding  remarks 

Kinite  mixture  models  have  become  increasingly  popular  as  models  for  unsupervised  learning, 
pailly  because  they  occupy  an  interesting  niche  between  parametric  and  nonparametric  ap¬ 
proaches  to  statistical  estimation.  Mixture-ba.sed  approaches  are  parametric  in  that  particular 
parametric  forms  must  be  chosen  for  the  component  densities,  but  they  can  also  be  regarded  as 
nonparametric  by  allowing  the  number  of  components  of  the  mixture  to  grow.  The  advantage  of 
this  niche  in  statistical  theory  is  that  the.se  models  have  much  of  the  flexibility  of  nonparametric 
approaches,  but  retain  some  of  the  analytical  advantages  of  parametric  approaches  (Mcl.achlan 
iV  Basford,  19SS).  Similar  remarks  can  be  made  in  the  ca.se  of  supervised  learning:  The  ME 
architecture  and  the  HME  architecture  provide  flexible  models  for  general  nonlinear  regres¬ 
sion  while  retaining  a  strong  flavor  of  parametric  statistics.  The  latter  model,  in  particular, 
compares  favorably  to  decision  tree  models  in  this  regard  (.Jordan  A.’  .Jacobs,  in  press). 

In  the  current  paper  we  have  contributed  to  the  theory  of  mixture- based  supervised  learning. 
We  have  analyzed  an  EM  algorithm  for  ME  and  HME  architectures  and  provided  theorems  on 
the  convergence  of  this  algorithm.  In  particular,  we  have  shown  that  learning  algorithm  can  be 
regarded  as  a  variable  metric  algorithm  with  its  metric  matrix  P  being  positive  definite,  so  that 
the  searching  direction  of  the  algorithm  always  has  a  positive  projection  on  the  gradient  of  the 
log  likelihood.  We  have  shown  that  the  algorithm  converges  linearly,  with  a  rate  determined 
by  the  difference  between  the  minimal  and  maximal  eigenvalues  of  a  negative  definite  matrix. 

Similar  results  to  those  obtained  here  can  also  be  obtained  for  the  case  of  the  unsupervised 
learning  of  finite  mixtures  (Xu  tk:  .Jordan.  1993). 
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Appendix 

The  theoretical  results  presented  in  the  main  text  show-  that  the  EM  algorithm  for  the  ME 
and  HME  architectures  converges  linearly  with  a  rale  determined  by  the  condition  number  of  a 
particular  matrix.  These  results  were  obtained  for  a  special  case  in  which  the  expert  networks 
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are  linear  vvitli  a  e,aiissiaii  ()i(.)[)al)ility  uuxlel  and  tlie  gating  iietw(jrks  ar<'  innltinoniial  logit 
models.  In  tliis  section  we  discuss  e.xtensions  of  tliese  results  to  other  architectures. 

VVe  first  note  that  Theorems  2  and  1  make  no  specific  relVuence  to  the  particular  prohahility 
models  utilized  in  specifying  the  architecture.  The  results  oji  convergeme  rate*  in  tliese  tlu'orems 
re(|uire  only  that  the  matri.x  F  he  positive  definite.  I'liese  theorems  apply  directly  to  other 
architectures  if  the  corresponding  matrices  can  lie  shown  to  lx*  positive  definite.  W’e  therefore 
iu»ed  only  consider  generalizations  of  Theorem  1.  the  theorem  which  established  the  positive 
definiteness  of  F  for  the  generalized  linear  MK  architect  tires.  .An  analogous  generalization  of 
Theorem  3  for  the  HME  architectures  can  also  be  obtained. 

Let  us  consider  the  case  in  which  the  function  implemented  by  each  e.xpert  network 
is  nonlinear  in  the  parameters.  V\'e  consider  two  possible  updates  for  the  param¬ 
eters:  (1)  a  gradient  algorithm 
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and  (2)  a  Newton  algorithm; 
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where  ')j  >  0  is  a  learuiug  rate. 

These  updates  are  covered  by  the  following  e.xtension  of  Theorem  1. 


Theorem  lA  1  For  the  model  given  by  Eej.  (4)  eind  the  vpeleite»  given  by  Eg.  (8.i)  or  Eg. 
(85).  we  have: 
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where 


ks  positive  elejinite. 


Proof.  For  the  gradient  descent  algorithm,  we  have  which  is  obviously  positive 

definite  because  7^  >  0.  For  the  Newton  algorithm,  we  have  that  P-^*  =  We  now 

show  that  this  matrix  is  positive  definite.  For  an  arbitrary  vector  u.  we  have 
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Etiuality  liolds  only  when  v  =  u 


—  11  '  - * - : - - - 


=  0.  since  is  positive  definite  with  piol)al)ility 


one.  rids  is  itiipossil)le  for  any  u.  So  with  prohahility  one.  (and  thus  ‘  also)  is 

positive  deliidte.  □ 

-Note  that  tlie  Newton  uiidate  ( Kcp  K.j)  is  particularly  ap|)ro[)riate  for  the  case  in  w  hich  the 
experts  are  geiHualized  litiear  tiiodels  (McCullagh  .N'  .N’elder.  19n;{):  tliat  is.  the  <ase  in  which 
6, )  =  [fji(x^'K  0, ).  •  •  • .  0j )]  (dfi  is  the  diniension  of  y )  w  it  h 


/„(x''>.0,)  =  x<'>  +  0,„.  +  ,). 


where  tt  continuous  univariate  nonlinear  function  known  as  the  link-  futiction.  In  this 

case  the  Newton  algorithm  reduces  to  the  IRLS  algorithm.  The  extension  to  generalized  linear 
tnodels  also  allows  probability  tiiodels  from  the  generalized  exponential  family  (cf.  .Jordan  F 
Jacobs,  in  press)  and  Theorem  l.\  is  applicable  to  this  case  as  well. 

W  e  can  also  consider  the  ca.se  in  which  the  gating  network  is  nonlinear  in  the  parameters. 
Both  the  Newton  update  (IRLS  update)  and  the  gradient  update  are  applicable  in  this  case. 
Theorem  I  already  established  that  the  .Newton  update  for  the  gating  network  involves  a  positive 
definite  /y  matrix.  .\s  in  Theorem  l.\.  the  result  for  the  gradient  update  is  immediate. 
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